TransformationType

Contents

TransformationType#

The TransformationType class is responsible for handling various types of transformations on input sequences, particularly for Chinese text. It provides functionality to identify and categorize different types of character transformations, such as similar shapes, similar pronunciations, and common confusions in Chinese characters.

Key Features#

  • Handles multiple types of character transformations

  • Supports both character-level and byte-level processing

  • Utilizes various dictionaries for efficient lookup of similar characters, pronunciations, and shapes

  • Configurable priority order for different distortion types

Usage#

Here’s a basic example of how to use the TransformationType class:

from lmcsc.transformation_type import TransformationType

# Initialize the TransformationType with a vocabulary
vocab = {...}  # Your vocabulary dictionary
transformer = TransformationType(vocab, is_bytes_level=False)

# Get transformation types for a sequence
observed_sequence = "你好"
transformations, _ = transformer.get_transformation_type(observed_sequence)
print(transformations)

Configuration#

The TransformationType class can be configured using a YAML configuration file. The default configuration file is located at configs/default_config.yaml. You can specify a custom configuration file path when initializing the class:

transformer = TransformationType(vocab, is_bytes_level=False, config_path="path/to/custom_config.yaml")

Transformation Types#

The class identifies several types of transformations:

  • IDT: Identical character (no transformation)

  • PTC: Prone to confusion (commonly confused characters)

  • SAP: Same pinyin (characters that share the same pinyin)

  • SIP: Similar pinyin (characters with similar pinyin)

  • SIS: Similar shape (characters with similar visual appearance)

  • OTP: Other pinyin error (pinyin-related errors not covered by SAP or SIP)

  • OTS: Other similar shape (shape-related errors not covered by SIS)

  • UNR: Unrecognized transformation (no known transformation type)

Advanced Usage#

The TransformationType class provides several advanced features:

  1. Handling of Out-of-Vocabulary (OOV) characters

  2. Building inverse indices for efficient lookup

  3. Customizable distortion type priorities

  4. Support for both character-level and byte-level processing

For more details on these advanced features, please refer to the class documentation.

API Documentation#

class lmcsc.transformation_type.TransformationType(vocab, is_bytes_level, distortion_type_prior_priority=None, config_path='configs/default_config.yaml')[source]#

Bases: object

A class for handling various types of transformations on input sequences, particularly for Chinese text.

This class provides functionality to identify and categorize different types of character transformations, such as similar shapes, similar pronunciations, and common confusions in Chinese characters.

Parameters:
  • vocab (dict) – A dictionary mapping tokens to their indices in the vocabulary.

  • is_bytes_level (bool) – Flag indicating whether the input is at the byte level.

  • distortion_type_prior_priority (list, optional) – A list specifying the priority order of distortion types. If not provided, a default order is used.

  • config_path (str, optional, defaults to ‘configs/default_config.yaml’) – Path to the configuration file containing paths to various dictionaries and resources.

similar_shape_dict#

Dictionary of characters with similar shapes.

Type:

dict

shape_confusion_dict#

Dictionary of characters prone to shape-based confusion.

Type:

dict

similar_consonant_dict#

Dictionary of similar consonants in pinyin.

Type:

dict

similar_vowel_dict#

Dictionary of similar vowels in pinyin.

Type:

dict

similar_spell_dict#

Dictionary of characters with similar spellings.

Type:

dict

near_spell_dict#

Dictionary of characters with near spellings.

Type:

dict

prone_to_confusion_dict#

Dictionary of characters prone to confusion.

Type:

dict

vocab#

The input vocabulary.

Type:

dict

is_bytes_level#

Flag indicating byte-level processing.

Type:

bool

distortion_type_priority_order#

Ordered list of distortion type priorities.

Type:

list

distortion_type_priority#

Dictionary mapping distortion types to their priorities.

Type:

dict

Note

This class relies on various external resources and dictionaries for Chinese language processing, which should be properly configured in the specified config file.

build_distortion_type_priority(distortion_type_prior_priority)[source]#

Builds the distortion type priority.

Parameters:

distortion_type_prior_priority (list, optional) – A list specifying the priority order of distortion types. If not provided, a default order is used.

load_dict(file_name)[source]#

Loads and returns a JSON dictionary from a file.

Parameters:

file_name (str or list) – The name of the file to load the dictionary from.

Returns:

The loaded dictionary.

Return type:

dict

load_list(file_name)[source]#

Loads and returns a list from a json file.

load_similar_spell_dict(file_name)[source]#

Loads the spell distance matrix from a pickle file.

Parameters:

file_name (str) – The name of the file to load the spell distance matrix from.

Returns:

A tuple containing two dictionaries: similar_spell_dict and near_spell_dict.

Return type:

tuple

bag_of_chars_hash(token)[source]#
init_pinyin_of_token_hash(token_consonants)[source]#
build_inverse_index()[source]#

Builds inverse indices for efficient lookup.

This method constructs multiple index dictionaries that map certain features of tokens (such as pinyin, character positions, etc.) to the indices of tokens in the vocabulary. These indices are used to efficiently perform lookups based on various transformation types, such as identical characters, similar pinyin, similar shapes, etc.

It processes each token in the vocabulary and builds indices for:

  • Identical characters at specific positions (identical_char_index)

  • Characters prone to confusion at specific positions (prone_to_confusion_char_index)

  • Tokens sharing the same pinyin at specific positions (same_pinyin_index)

  • Tokens with similar pinyin at specific positions (similar_pinyin_index)

  • Tokens with pinyin that are similar due to spelling errors at specific positions (other_similar_pinyin_index)

  • Tokens with characters of similar shapes at specific positions (similar_shape_index)

  • Tokens with characters of shapes that are confused at specific positions (other_similar_shape_index)

  • Identical tokens (identical_token_index)

Additionally, it keeps track of:

  • Token lengths (token_length)

  • Mapping of indices back to tokens (idx_to_token)

  • Set of all unique characters in tokens (char_set)

These indices facilitate quick retrieval of tokens based on various linguistic and orthographic features.

handle_oov_characters(observed_sequence)[source]#

Handles out-of-vocabulary (OOV) characters.

Parameters:

observed_sequence (str or bytes) – The observed sequence containing OOV characters.

Returns:

A dictionary mapping token indices to their corresponding transformation types.

Return type:

dict

get_pinyin_data(observed_sequence)[source]#

Gets pinyin data for the observed sequence.

Parameters:

observed_sequence (str) – The observed sequence.

Returns:

A tuple containing two elements:
  • list: A list of pinyin representations for the observed sequence.

  • list: A list of consonant representations for the observed sequence.

Return type:

tuple

handle_identical_characters(i, char, token_transformation)[source]#

Handles identical characters.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • char (str) – The character in the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_prone_to_confusion(i, char, token_transformation)[source]#

Handles characters prone to confusion.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • char (str) – The character in the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

is_punctuation_or_space(char)[source]#

Checks if a character is a punctuation or space.

Parameters:

char (str) – The character to check.

Returns:

True if the character is a punctuation or space, False otherwise.

Return type:

bool

handle_continuous_punctuation_or_space(i, observed_sequence, token_transformation)[source]#

Handles continuous punctuation or space.

handle_redundant_before_punctuation_or_space(i, char, observed_sequence, token_transformation, original_token_length)[source]#

Handles redundant characters before punctuation or space.

handle_same_pinyin(i, token_pinyins, token_transformation)[source]#

Handles characters with the same pinyin.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • token_pinyins (list) – A list of pinyin representations for the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_reorder_tokens(i, part_observed_sequence, token_transformation)[source]#

Handles reordered tokens.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • part_observed_sequence (str) – The observed sequence without the character at index i.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_initial_pinyin_match(i, part_consonants, token_transformation)[source]#

Handles initial pinyin match. For example, “jq” -> “机器”, “精确”, …

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • part_consonants (list) – A list of consonant representations for the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_similar_pinyin(i, token_pinyins, token_transformation)[source]#

Handles characters with similar pinyin.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • token_pinyins (list) – A list of pinyin representations for the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_similar_shape(i, char, token_transformation)[source]#

Handles characters with similar shapes.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • char (str) – The character in the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_other_pinyin_error(i, token_pinyins, token_transformation)[source]#

Handles other pinyin errors.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • token_pinyins (list) – A list of pinyin representations for the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_redundant_character_inside_token(i, part_observed_sequence, token_transformation, original_token_length)[source]#

Handles redundant characters inside the token.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • part_observed_sequence (str) – The observed sequence without the character at index i.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

  • original_token_length (dict) – A dictionary mapping token indices to their original lengths.

handle_redundant_characters(observed_sequence, token_transformation, original_token_length)[source]#
handle_other_similar_shape(i, char, token_transformation)[source]#

Handles characters with other similar shapes.

Parameters:
  • i (int) – The index of the character in the observed sequence.

  • char (str) – The character in the observed sequence.

  • token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

handle_missing_characters(observed_sequence, broken_token_transformation, original_token_length_for_broken)[source]#

Handles missing characters.

Parameters:
  • observed_sequence (str) – The observed sequence.

  • broken_token_transformation (set) – A set of token indices with missing characters.

  • original_token_length_for_broken (dict) – A dictionary mapping token indices to their original lengths.

filter_and_finalize_transformations(token_transformation, broken_token_transformation, original_token_length, original_token_length_for_broken)[source]#

Filters and finalizes the transformations.

Parameters:

token_transformation (dict) – A dictionary mapping token indices to their corresponding transformation types.

Returns:

A dictionary mapping token indices to their finalized transformation types.

Return type:

dict

handle_final_oov(observed_sequence)[source]#

Handles out-of-vocabulary (OOV) characters as a final step.

Parameters:

observed_sequence (str) – The observed sequence containing OOV characters.

Returns:

A dictionary mapping token indices to their corresponding transformation types.

Return type:

dict

get_transformation_type(observed_sequence: str)[source]#

Determine the transformation types for all tokens in the vocabulary to the observed sequence.

This method analyzes the input sequence and identifies various types of character transformations that may have occurred, such as character substitutions, pinyin-based errors, or shape-based confusions. It returns a mapping of token indices to their corresponding transformation types.

Parameters:

observed_sequence (str) – The input sequence of characters to be analyzed for transformations.

Returns:

A tuple containing two elements:
  • A dictionary mapping token indices to a tuple of their corresponding transformation types.

  • A dictionary of the original token lengths (currently empty in this implementation).

Return type:

Tuple[Dict[int, Tuple[str]], Dict[int, int]]

Transformation Types:
  • IDT: Identical character (no transformation).

  • PTC: Prone to confusion (commonly confused characters).

  • SAP: Same pinyin (characters that share the same pinyin).

  • SIP: Similar pinyin (characters with similar pinyin).

  • SIS: Similar shape (characters with similar visual appearance).

  • OTP: Other pinyin error (pinyin-related errors not covered by SAP or SIP).

  • OTS: Other similar shape (shape-related errors not covered by SIS).

  • MIS: Missing characters (characters that are missing from the observed sequence).

  • RED: Redundant characters (characters that are not needed in the observed sequence).

  • UNR: Unrecognized transformation (no known transformation type).

Example

>>> transformer = TransformationType(vocab, is_bytes_level=False)
>>> transformations, _ = transformer.get_transformation_type("你好")
>>> print(transformations)
{36371: ('IDT', 'OTP'), 8225: ('IDT', 'UNR'), ...}

Note

  • This method relies on prior methods such as get_pinyin_data, handle_identical_characters, and various handlers for specific distortion types.

  • The distortion_type_priority_order attribute determines the order in which distortion handlers are applied.